Exploring the Use of Linguistic Features in Domain and Genre Classification
نویسندگان
چکیده
The central questions are: How useful is information about part-of-speech frequency for text categorisation? Is it feasible to limit word features to content words for text classifications? This is examined for 5 domain and 4 genre classification tasks using LIMAS, the German equivalent of the Brown corpus. Because LIMAS is too heterogeneous, neither question can be answered reliably for any of the tasks. However, the results suggest that both questions have to be examined separately for each task at hand, because in some cases, the additional information can indeed improve performance. 1 I n t r o d u c t i o n The greater the amounts of text people can access and have to process, the more important efficient methods for text categorisation become. So far, most research has concentrated on contentbased categories. But determining the g e n r e of a text can also be very important , for example when having to distinguish an EU press release on the introduction of the euro from a newspaper commentary on the same topic. The results of e.g. (Lewis, 1992; Yang and Pedersen, 1997) indicate that for good content classification, we basically need a vector which contains the most relevant words of the text. Using n-grams hardly yields significant improvements, because the dimension of the document representation space increases exponentially. But do wordbased vectors also work well for genre detection? Or do we need additional linguistically motivated features to capture the different styles of writing associated with different genres? In this paper, we present a pilot study based on a set of easily computable linguistic features, namely the frequency of part-of-speech (POS) tags, and a corpus of German, LIMAS (Glas, 1975), which contains a wide range of different genres. LIMAS is described briefly in Sac. 3, while sections 2 and 4 motivate the choice of features. The text categorisation experiments are described in Sec. 5. 2 L i n g u i s t i c C u e s to G e n r e 2.1 W h a t is genre? The term "genre" is more frequent in philology and media studies than in mainst ream linguistics (Swales, 1990, p.38). When it is not used synonymously with the terms "register" or "style", genre is defined on the basis of non-linguistic criteria. For example, (Biber, 1988) characterises genres in terms of author/speaker purpose, while text types classify texts on the basis of text-internal criteria. Swales phrases this more precisely: Genres are collections of communicative events with shared communicative purposes which can vary in their prototypicality. These communicative purposes are determined by the discourse communi ty which produces and reads texts belonging to a genre. But how can we extract its communicative purpose from a given text? First of all, we need to define the genres we want to detect. The definitions which were used in this study are summarised in section 3.1. If we assume that the culture-specific conventions which form the basis for assigning a given text to a certain genre are reflected in the style of the text, and if tha t style can be characterised quantitatively as a tendency to favour certain linguistic options over others (Herdan, 1960), we can then proceed to search for linguistic features which both discriminate well between our genres and can also be computed reliably from unannotated text. Potential sources for such options are comparat ive genre studies (Biber, 1988), authorship attribution research (Holmes, 1998; Forsyth and Holmes, 1996), content analy-
منابع مشابه
Gender-preferential Linguistic Elements in Applied Linguistics Research Papers: Partial Evaluation of a Model of Gendered Language
This article intended to investigate whether the gender-preferential linguistic elements found by Argomon, Koppel, Fine and Shimoni (2003) show the same gender-linked frequencies in applied linguistics research papers written by non-native speakers of English. In so doing, a sample of 32 articles from different journals was collected and the proportion of the targeted features to the whole numb...
متن کاملExploring Impacts of Consciousness-raising in a Genre-based Pedagogy
This study reports on the findings of a genre teaching course for developing academic writing of a class of EFL students in Iran. The information report genre was taught in a cyclical way of teaching and learning, which was started from ‘setting the context’ and ‘deconstruction’ of prototype information report genre, and continued with ‘joint construction’, ‘independent construction’, and final...
متن کاملMoving Against the Grain: Exploring Genre-Based Pedagogy in a New Context
Considerable literature explores the contribution of genre teaching in English academic writing. The role of this approach in developing academic writing of Iranian EFL students, however, has been underresearched. This study investigated the implications of using this approach with a class of undergraduate students in Iran. The current study reports on the findings of a project which employed a...
متن کاملThe Effect of Genre Awareness on English Translation Quality and Pedagogy: A Case of News Reports Translation as an Academic Curriculum
To produce an adequate translation, language students are required to learn varieties of language features including syntax, semantics and pragmatics. Considering the curriculum language learners are face with, one can claim that almost all language students in Iran are taught these features in their academic settings including linguistic courses. Yet, there are some aspects of language which a...
متن کاملشناسایی خودکار سبک موسیقی
Nowadays, automatic analysis of music signals has gained a considerable importance due to the growing amount of music data found on the Web. Music genre classification is one of the interesting research areas in music information retrieval systems. In this paper several techniques were implemented and evaluated for music genre classification including feature extraction, feature selection and m...
متن کاملذخیره در منابع من
با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید
عنوان ژورنال:
دوره شماره
صفحات -
تاریخ انتشار 1999